MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

作者信息

Microsoft

链接:

[2407.02490] MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention | OpenReview

摘要:

The computational challenges of Large Language Model (LLM) inference remain a significant barrier to their widespread deployment, especially as prompt lengths continue to increase. Due to the quadratic complexity of the attention computation, it takes 30 minutes for an 8B LLM to process a prompt of 1M tokens (i.e., the pre-filling stage) on a single A100 GPU. Existing methods for speeding up prefilling often fail to maintain acceptable accuracy or efficiency when applied to long-context LLMs. To address this gap, we introduce MInference (Milliontokens Inference), a sparse calculation method designed to accelerate pre-filling of long-sequence processing. Specifically, we identify three unique patterns in long-context attention matrices-the A-shape, Vertical-Slash, and Block-Sparsethat can be leveraged for efficient sparse computation on GPUs. We determine the optimal pattern for each attention head offline and dynamically build sparse indices based on the assigned pattern during inference. With the pattern and sparse indices, we perform efficient sparse attention calculations via our optimized GPU kernels to significantly reduce the latency in the pre-filling stage of long-context LLMs. Our proposed technique can be directly applied to existing LLMs without any modifications to the pre-training setup or additional fine-tuning. By evaluating on a wide range of downstream tasks, including InfiniteBench, RULER, PG-19, and Needle In A Haystack, and models including LLaMA-3-1M, GLM4-1M, Yi-200K, Phi-3-128K, and Qwen2-128K, we demonstrate that MInference effectively reduces inference latency by up to 10x for pre-filling on an A100, while maintaining accuracy. Our code is available at this https URL.

总结概括

Attention是高度稀疏的,且Attention计算的稀疏性会随着输入的不同而呈现不同的分布。所以,先前的部分工作采取了动态的方法来计算Attention计算的稀疏性,但这种动态计算的方法会带来很多额外的计算开销。本工作的核心目的就是最小化这一计算开销。

本工作通过实验,发现Attention计算在不同的头中一般呈现三种模式:A-shape模式,Vertical-Slash模式和Block-Sparse模式。

在性能优化方面,MInference引入了内核感知的最优稀疏模式搜索;同时,MInference引入了三种不同的dynamic sparse mask计算方法,同时针对这三种不同的dynamic sparse mask设立了对应的GPU kernel函数。

最终在保持较高性能的情况下实现了计算时间的减少。

image-20250123221855813

Motivation

Attention具有动态的稀疏性

image-20250123224424864

动态性:图b prompt得出来的Top-4k的位置给另外一个prompt(图c),Attention分数下降到了83.7%。

且这种动态性在以下两个工作中得到了准确的数学证明。

Valerii Likhosherstov, Krzysztof Choromanski, and Adrian Weller. On the expressive power of self-attention matrices. ArXiv preprint, abs/2106.03764, 2021.

Valerii Likhosherstov, Krzysztof Choromanski, and Adrian Weller. On the expressive flexibility of self-attention matrices. Proceedings of the AAAI Conference on Artificial Intelligence, 37(7):8773–8781, 2023.

Attention稀疏性具有一定的空间分布规律

image-20250124112337725

image-20250124112600620

  • A-shape pattern

    • 最前面和局部的两个窗口

      [HWP+24] Chi Han, Qifan Wang, Hao Peng, Wenhan Xiong, Yu Chen, Heng Ji, and Sinong Wang. LM-infinite: Zero-shot extreme length generalization for large language models. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3991– 4008, Mexico City, Mexico, 2024. Association for Computational Linguistics.

      [XTC+24] Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. In The Twelfth International Conference on Learning Representations, 2024.

  • Vertical-Slash (VS) pattern

    • 一直关键的token和一部分固定距离的token

      [MJ23] Amirkeivan Mohtashami and Martin Jaggi. Random-access infinite context length for transformers. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.

  • Block-Sparse pattern

    • 更加随机的,以语义块形式组成的。
    • 图b证明了

创新点或贡献

具体设计

  1. 问题定义

    Attention的计算公式可以在稀疏化情况下变成以下形式,c是一个很大的值,M是动态稀疏的Attention Mask,通过这种方式可以使得Mask之外的Attention结果在Softmax后约等于0。

    image-20250206192726275

    由此引出一个动态稀疏化的Attention计算需要在尽可能保持Attention计算结果下实现开销的最小化。

    image-20250206192752132

  2. 性能优化

    1. Kernel-Aware Optimal Sparse Pattern Search

      先根据所需要的FLOPs要求建立不同配置情况下的搜索空间(patterns,settings);然后在优化后的搜索空间中进行寻找,为不同的头分别设置最优的配置。

      image-20250206211324749

    2. Sparsity Indices Approximation and Dynamic Sparse Attention Calculation

      实现动态的Attention计算还有一个关键的开销,就是mask的配置。

      对于Vertical-Slash Head,则是利QK相乘的结果,获取topk个关键的vertical line和slash line;对于Block-Sparse Head,则是通过池化,将QK分成一个个块,利用块的相乘来加速计算,再挑选出重要的top k个块。

      image-20250206213621144

results matching ""

    No results matching ""